### Demonstration Sentiment Analysis

This notebook is for training purposes and as such has been simplified to highlight the core steps in training an ML model.

### Step 1: Load, Clean, Prepare the data

In [None]:
import pandas as pd

df = pd.read_csv('https://s3.eu-west-1.amazonaws.com/neueda.conygre.com/pydata/ml_fc/Restaurant_Reviews.tsv', sep='\t')

print(df.shape)
df.head()

Note - for simplicity this preparation function is greatly simplified. We could include further steps such as:
* stemming / lemmatization
* removing stop words
* identification of the most symantically valuable words

In [None]:
import re

def prepare_review(review):
    # TODO: clean up each review
    return review

In [None]:
prepare_review('Not tasty and the texture was just nasty.')

In [None]:
# apply the prepare_review function to each review (df.apply)


### This step is encoding the "Bag of Words" - the CountVectorizer utility does this for us.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer



### Step 2: Split into independant and dependant variables

In [None]:
# remember to pass reviews through the count vectorizer first


### Step 3: Split our data into training and test sets 

In [None]:
from sklearn.model_selection import train_test_split


### Step 4: Choose and train the model

In [None]:
from sklearn.linear_model import LogisticRegression


#### Here we are demonstrating inference / prediction

### Step 5: Validate / Measure the model

In this case as this is a classification problem, we will use a confusion matrix. There are a number of further calculations we might do to extract more metrics from our model.


In [None]:
from sklearn.metrics import confusion_matrix



### For reference, how do some other sklearn classification models perform?

In [None]:
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC

In [None]:
MNB_classifier = MultinomialNB()
MNB_classifier.fit(X_train, y_train)
cmatrix = confusion_matrix(y_test, MNB_classifier.predict(X_test))
print("MNB_classifier accuracy:", (cmatrix[0][0]  + cmatrix[1][1]) / cmatrix.sum())


In [None]:
BNB_classifier = BernoulliNB()
BNB_classifier.fit(X_train, y_train)
cmatrix = confusion_matrix(y_test, BNB_classifier.predict(X_test))
print("BNB_classifier accuracy:", (cmatrix[0][0]  + cmatrix[1][1]) / cmatrix.sum())


In [None]:
LR_classifier = LogisticRegression()
LR_classifier.fit(X_train, y_train)
cmatrix = confusion_matrix(y_test, LR_classifier.predict(X_test))
print("LR_classifier accuracy:", (cmatrix[0][0]  + cmatrix[1][1]) / cmatrix.sum())


In [None]:
SGD_classifier = SGDClassifier()
SGD_classifier.fit(X_train, y_train)
cmatrix = confusion_matrix(y_test, SGD_classifier.predict(X_test))
print("SGD_classifier accuracy:", (cmatrix[0][0]  + cmatrix[1][1]) / cmatrix.sum())


In [None]:
SVC_classifier = LinearSVC()
SVC_classifier.fit(X_train, y_train)
cmatrix = confusion_matrix(y_test, SVC_classifier.predict(X_test))
print("SVC_classifier accuracy:", (cmatrix[0][0]  + cmatrix[1][1]) / cmatrix.sum())


In [None]:
from sklearn.neural_network import MLPClassifier

MLP_classifier = MLPClassifier(hidden_layer_sizes=(20), max_iter=1000)
MLP_classifier.fit(X_train, y_train)

cmatrix = confusion_matrix(y_test, MLP_classifier.predict(X_test))
print("MLP_classifier accuracy:", (cmatrix[0][0]  + cmatrix[1][1]) / cmatrix.sum())